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Abstract: 

'N--  The  purpose  of  this  work  is  to  study  the  solution  of  &  2-D  compressible  viscous  fluid  flow  in  a 
nozzle  on  a  Hypercube. 

The  analysis  of  the  physical  problem  on  a  serial  computer  and  an  outline  of  its  parallelization 
on  the  hypercube  were  detailed  ia-fWf. 

We  propose  now  in  this  paper^td  detail  all  the  steps  of  the  parallelization  and  our  results: 
speedups,  efficiencies  and  communication  time  versus  arithmetic  time.  r  ..  4  .  ", 
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1.  Introduction 
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The  purpose  of  this  work  is  to  study  the  solution  of  a  2-D  compressible  fluid  flow  on  a  Hyper¬ 
cube,  and  to  see  the  improvement  in  the  computational  time  got  with  such  a  distributed  memory 
message-passing  machine. 

In  the  realization  of  this  project,  the  chosen  methodology  was  first  to  solve  the  physical  problem 
by  the  implementation  of  a  serial  code,  then  to  study  the  parallelization  of  this  code  on  a  Hypercube 
[10]. 

The  solution  of  the  fluid  flow  in  a  convergent-divergent  nozzle  is  a  problem  wich  has  been 
studied  by  several  authors  on  sequential  [3,  4,  10,  12,  20]  and  vector  [19]  computers  and  we  propose 
now  to  study  the  parallelization  of  this  problem  on  a  machine  such  as  the  iPSC  Intel  Hypercube. 

A  general  technique  of  solution  of  the  Navier-Stokes  equations  [17,  18]  mainly  based  on  the 
combiration  of  a  general  curvilinear  coordinate  transformation  aqd  An  implicit  method  has  given 
some  good  results  and  is  used  here. 

More  precisely,  the  Navier-Stokes  equations  are  solved  with  the  Beam- Warming  implicit  me 
-  thod  [l],  which  leads  to  the  computation  of  two  sets  of  linear  systems  alternatively  solved  (ADI 
method) . 

The  implementation  of  this  numerical  simulation  on  the  Hypercube  is  based  on  a  block  method 
that  is  to  say  that  the  computational  domain  is  decomposed  in  strips  for  both  the  x  and  y-directions. 
Two  differents  schemes  of  communications  among  the  processors  are  implemented  and  gave  some 
good  results  concerning  the  efficiency  of  the  parallel  code. 

In  the  first  chapters,  we  introduce  the  physical  problem  which  lead  to  the  computation  of  a 
serial  code,  then  we  analyze  and  discuss  the  results  of  the  code  implemented  on  the  Hypercube. 

2.  The  model  equations  * 

2.1.  Transformed  Navier-Stokes  equations 

The  motion  inside  the  nozzle  (see  figure  1)  is  governed  by  the  unsteady  averaged  Navier-Stokes 
equations  in  the  inertial  cartesian  coordinates  (x,y,t).  If  we  take  for  the  dependent  variables,  the 
cartesian  velocity  components,  these  basic  set  of  equations  can  be  transformed  to  a  new  body-fitted 
curvilinear  coordinate  system  (£,  »7,r),  while  retaining  the  strong  conservation  law  form  [9,  17,  18, 
20,  22].  The  space  (£,rj,r)  called  computational  space  corresponds  to  a  rectangular  domain  [21]. 
Due  to  the  geometrical  properties  of  the  nozzle,  we  can  restrain  the  transformation  to  [19]: 

(£  =  fto.n  =  l(x,y),r  =  t). 
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The  resulting  equations,  written  in  the  non-dimensional  form  and  in  the  strong  conservation  law 
form,  are  given  by  [9,  17,  18]: 


dfq  +  d(e  +  dnf  =  Re-1[di(J-1{Ug 1))  +  U9i  +  (1) 


where: 
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with: 


rxx  =  (A  +  2/i)u*  +  Avy 

r*v  =  rv*  =  m(uv  +  vx) 

•  Tyy  =  (A  +  2  fl)vv  +  Au, 

9u  =  ur„  +  Wzy  +  kPr~l (7  -  i)-ldxa? 
g24  =  ur„  +  vryy  +  kPr "*(7  -  l)-13yaJ 

and: 


U  =  £xu  ,  V  =  rjxu  +  Tjvv  (2) 

The  velocities  U  and  V  are  called  the  contravariant  velocity  components,  and  correspond  to 
the  decomposition  of  the  vector  velocity  along  the  £  and  q  curvilinear  coordinates. 

J  is  the  transformation  Jacobian: 


J  =  £sriy.=  l/foy,,) 


(3) 


The  cartesian  derivatives  such  as  ux  are  expanded  in  the  (£,17)  space  via  chain-rule  relation: 


Uz  =  £x«{  +  r)z  Urj  (4) 

And,  the  metrics  £z,r?x,f7y  formed  from  chain-rule  extension  of  x^,y^,yn  are  given  by  the  fol¬ 
lowing  relations: 


Zt  =  Jyr,  >1*  = ’Vv- Jxt  (5) 

and: 

p :  density. 

u,v:  cartesian  velocity  components. 
en:  total  energy. 
p:  pressure. 

a:  sound  speed  (a  =  y/'jRT). 

T:  temperature. 

7:  ratio  of  specific  heats  (dry  air:  7  =  1.4). 
p:  dynamic  viscosity, 

A:  taken  with  the  Stokes  hypothesis  A  =  —  (2/3)p.  (6) 

k:  coefficient  of  thermal  conductivity. 

R:  gas  constant. 

U:  modulus  of  the  velocity  (U  =  \fu2  +  v2). 

D:  specific  length. 
cp:  specific  heat. 

Re:  Reynolds  number  (Re  =  ( UDp)/p ). 

Pr:  Prandtl  number  ( Pr  =  ( pcp/k ). 

M:  Mach  number  (M  =  Ufa ). 

The  equations  (1)  are  completed  by  the  equation  defining  the  pressure: 
p  =  (7  -  l)[en  -  0.5p(u2  +  v2)]  (7) 

As  our  nozzle  is  symmetric  about  the  central  axis  y=0,  we  can  restrict  our  study  to  the  upper 
side  of  the  nozzle. 
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Figure  1:  45°  —  15°  Convergent-Divergent  Nozzle 
2.2.  Simplified  form:  the  parabolic  equations 

Classically  for  high  Reynolds  number  flows,  we  solve  the  viscous  terms  only  near  the  rigid 
boundaries.  We  also  make  the  hypothesis  of  the  parabolic  approximation  that  the  viscous  terms  in 
£  (along  the  body)  are  neglected  and  only  the  viscous  terms  in  rj  are  retained. 

The  equations  (1)  are  then  equal  to  (9,  17,  20]: 


where: 


dTq  +  d(e  +  d„f  =  Rc-ldng 


(8) 


9  =  J~l 


(  0  \ 

M(ffz  +  >!*)«*  +  (M/3)tjx(rjxu,j  +  rjyv,,) 

M  +  V  JK  +  (/i/3)r?„(r7,ti„  +  tjvvn) 

[kPr-'i 7  -  l)_1(q*  +  r}2)d,,a2  +  n(rj2  +  t)2)(u2  +  v2)n/2 
V  +n/6(rj2u2  +  rt2yv2  +  2r7Ir?„(ut/)fJ)] 


(9) 


Note  that  unlike  boundary- layer  theory,  the  pressure  p  can  vary  through  the  viscous  layer,  and 
all  the  inertial  terms  of  the  normal  momentum  equation  are  retained. 
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3.  Boundary  and  Initial  conditions 

3.1.  Boundary  conditions 

The  equations  (8)  have  to  be  solved  in  the  computational  domain  subject  to  different  conditions 
on  the  boundaries. 

The  curvilinear  coordinate  system  (£,»?)  is  defined  such  that  the  inflow  and  outflow  boundaries 
are  two  ^-coordinate  lines,  and  the  wall  and  the  symmetric  axis  are  two  ^-coordinate  lines. 

At  the  nozzle  wall,  we  have  for  the  velocity  the  no-slip  condition:  u  =  v  =  0  and,  for  the 
temperature  we  can  take  the  adiabatic  condition:  n  •  VT  =  0,  with  n  the  vector  normal  to  the 
surface. 

Because  of  the  symmetry  of  the  nozzle,  we  consider  the  axis  y=0  as  one  of  the  boundaries  of 
this  problem  (see  figure  l).  In  order  to  limit  the  overloading  of  the  code,  we  just  limit  ourselves  to 
a  condition  on  the  vertical  component  of  the  velocity  v  :  v  =  0  as  it  is  an  odd  function  of  y. 

At  the  upsteam  inflow  boundary,  the  pressure  p  and  the  temperature  T  are  constant  during 
all  the  computation,  and  are  equal  to  their  stagnation  correspond  ants  that  is  :  p  =  p0  and,  T  =  T0. 

The  components  u  and  v  of  the  velocity  are  given  by  the  relation:  u  =  vtanfJ(y)  where  0(y)  is 
a  function  given  by  Holder  et  al  [5], 

And  finally,  as  the  downstream  outflow  boundary  corresponds  to  the  case  of  a  supersonic  flow 
(M  >  1),  no  boundary  conditions  are  required.  The  variables  are  determined  by  extrapolation 
from  the  interior  points. 

3.2.  Initial  condition 

All  the  variables  at  the  beginning  of  the  computation  were  computed  under  the  1-D  approxi¬ 
mation  method  (isentropic  flow,  perfect  fluid). 

In  such  an  approach,  the  values  of  the  variables  at  a  specified  section  of  the  nozzle  are  given  by 
the  geometric  ratio  of  the  aera  of  the  section  over  the  throat  section,  and  by  the  chosen  stagnation 
conditions  (6,  16].  In  particular,  we  have  taken  here:  p0  =  6.2  105 N/m2  and  T0  =  300 K . 

The  Reynolds  number  for  this  initial  condition  was  then  equal  to:  Re  =  1.5  x  10s. 


4.  Numerical  Method 


We  solve  the  equations  (8)  in  the  computational  domain  (£,r?)  subject  to  the  boundary  con¬ 
ditions  described  in  section  3.1,  with  the  implicit  delta- form,  approximate-factorization,  Beam- 
Warming  algorithm. 

An  implicit  numerical  method  is  employed  here  in  order  to  avoid  the  severely  restrictive  stabil¬ 
ity  conditions  of  an  explicit  method,  when  small  grid  spacings  are  used.  Such  a  situation  is  needed 
near  the  wall  for  an  accurate  computation  of  the  viscous  effects. 

In  the  basic  Beam- Warming  algorithm,  the  spatial  derivative  terms  in  the  Navier-Stokes  equa¬ 
tions  are  approximated  by  standard  second-order  accurate  central-difference  operators,  and  the 
implicit  0-method  of  Richtmyer  and  Morton  is  taken  for  the  time  differencing  [9], 

The  computation  of  the  boundary  points  can  be  computed  directly  by  the  numerical  resolution 
of  the  equations  (see  [20])  or  by  extrapolation  from  the  interior  points  at  the  end  of  each  time  step 
(see  [17]).  The  second  case  degrades  the  time  accuracy  on  the  boundary  to  a  zero-order  but  gives 
a  more  simple  scheme.  We  have  chosen  here  the  second  case. 


4.1.  Time  Differencing 

The  equations  of  Navier-Stokes  in  their  final  form  (8)  are  rewritten  here  as: 

dTq=-[d(e  +  dr,f-Re-1dng}  =  f  (10) 

Using  n  for  the  time  index  and  h  for  the  time  step,  we  apply  the  Richtmyer  and  Morton 
method,  and  we  obtain  [9]: 

qn+1  =  q»  +  h[{l  -  0)fn+1  +  9rn\  (11) 

This  method  is  first-order  accurate  in  time  for  9  =  0  (Implicit  Euler  method),  and  is  second- 
order  accurate  in  time  for  9  =  1/2. 

Since  we  seek  only  the  asymptotic  steady  state  solution,  we  can  employ  the  first-order  accurate  in 
time  method.  The  accuracy  of  the  solution  is  given  by  the  spatial  difference  operators  [  1  ] . 


gn+1  _  -n  +  hfn+l 


(12) 


$ 

Ml 


K 


In  order  to  define  the  non-linear  term  rn+1,  we  must  locally  linearized  the  terms  e,  /  and  g  in 
terms  of  q.  This  is  done  by  using  the  Taylor  series  expansions: 


en+l  =  en  +  En{qn+l  -  qn)  +  Q{h2). 


fn+ 1  =  fn  +  Fn(qn+1  -  qn)  +  0(h2). 


gn+ 1  =  gn  +  Gn(gn+1  -  qn)  +  0(h2). 


where  E,  F  and  G  are  the  flux  Jacobian  matrices: 


E  =  de/dq,  F  =  df/dq  ,  G  =  dg/dq 


The  matrices  E  or  F  are  given  by  [9,  17]: 


E,F  = 


where: 


Ki 


K2 


0  \ 


(  0 

Ki4>2  -  u9  0  -  Ki{ 7  -  2)u  Kju  -  {‘j  —  1) K iv  K\(i  —  1) 
K2<f>2-v9  Kiv  -  K2(i  -  l)u  9-K2{i-2)v  #2(7-1) 

V  9(2<f> 2  -  ~t(en/p))  [K\fi  -  (7  -  l)u0]  [#20  -  (7  ~  l)v0]  7#  ^ 


<£2  =  0.5(7  ~  l)(u2  +  v2)  ,  /?  =  ~f(en/p)  -  4> 2  and,  0  =  #iu  +  #2»> 


with  the  following  definition  of  the  coefficients  K\  and  K2: 


for  E:  K\  =  £x,  K2  =  0  and,  for  F:  K\  =  r\t ,  #2  = 


And  for  the  viscous  flux  Jacobian  term,  we  have  [9,  17]: 


(13-a) 

(13.6) 

(13.c) 


(14) 


(15) 


(16) 


(17) 


0  0  0  0 

921  <*idn(l /p)  a2dn(l/p)  0 

331  a2dn(l/p)  ctsd^l/p)  0 

<741  942  343  a4dt,{l/p) 

where: 


and: 

“l  =  M[(4/3)»7*  +  q2] 
t*2  =  (m/ ty’lx’lv 
as  =  M  +  (4/3)*7y] 
aA  =  ikPr-'in*  +  » ;2) 

941  =  ai5„(-ti2//>)  +  a26n(-2uv/p)  +  as«,(-wJ/p)  +  a46n(-en/p2)  +  a4^[(«2  +  v2)/p] 

942  =  ~32i  -  <*4Mu/p) 

343  =  -3si  -  a4 Sn(v/p) 

By  means  of  the  local  linearizations  the  equations  (12)  are  then  rewritten: 

[/  +  h(3€£"  +  d„Fn  -  Re-1dnGn)]Aqn  =  hfn  (20) 

with: 

A<T  =  qn+l  -  q"  (21) 

where  the  linear  operator  notation  is  to  be  interpreted  as  following: 

(d(En)Aqn  =  d^(ETlAqn)  (22) 


The  left  hand  side  of  equation  (20)  can  be  factorized  into  a  product  of  two  one-dimensional 
operators,  with  the  same  order  of  accuracy  [lj. 

This  results  in  the  factorized  form  of  equation  (20): 
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([/  +  fcde£*][/  +  h{dnFn  -  Re-ld„Gn)})Aqn  =  hrn. 


(23) 


the  right  hand  side  of  equation  (20)  is  defined  as: 

rn  =  - [d£e"  +  d„fn  -  Re-ld„gn ].  (24) 


Finally,  the  vector  solution  qn+1  is  given  by  the  following  ADI  sequence: 


[/  +  hd£i7*lA?*n  =  hfn 

(25.o) 

[I  +  h(dnFn  -  Re~1dnGn)]Aqn  =  A q*n 

(25.6) 

q"+1  =  9"  +  Af* 

(25  c) 

For  convenience,  we  have  omitted  the  spatial  subscript  notations,  all  the  terms  were  taken  here 
at  the  point  (t,  j)  of  the  computational  domain. 

But  from  now  on,  we  will  use  the  spatial  subscripts  again. 

4.2.  Space  Differencing 

The  resolution  of  the  ADI  sequence  (25.a,25.6, 25. c)  is  completed  by  the  choice  of  the  finite 
difference  operators  S  for  the  spatial  derivatives  d£  and  dn. 

We  use  here  to  approximate  the  convective  derivatives,  the  standard  central-difference  second 
order  accurate  operator.  For  example,  we  have  for  the  first  derivative  of  equation  (25.a): 

SiEij  =  [dEij/dt)  =  (Ei+1J  -  Et-ij)/ 2A£  (26) 

From  the  general  form  of  the  viscous  term  written  as  follows: 

dKaijdfrj/dril/dr,  (27) 

we  see  that  we  have  to  approximate  the  viscous  derivative  in  a  more  complicated  way. 


If,  we  define  the  following  central-difference  operator  for  a  general  viscous  term  h: 


( dhij/dri )  »  (hij+l/2  -  hij-x/t)/ Aq  (28) 

its  application  on  the  viscous  term  defined  in  equation  (27)  gives  the  generalized  three-points 
central-difference  second-order  accurate  operator: 

d[(<*ijd0ij/dri)\/dti  = 

1/(2  x  AqJ)a,J+1  +  ciij)Pij+ 1  -  (ot» j4-i  +  2a ij  +  atJ_i)Aj  -I-  (a,-,,-  +  “»j-i)A,>-i]  (29) 

corresponding  to  the  following  notation  :  Sjfcj  and  S'^Gij 
From  these  definitions,  the  ADI  sequence  is  rewritten  as  follows: 


V+HflKj  -  R 


(30  .a) 
(30.6) 


.n+1 


= 


(30. c) 


with: 


%  =  -[«.%  +  Vo  -*«-**, ffjl 


(31) 


Note  that  by  the  choice  of  a  central-difference  scheme  for  the  space  derivatives,  each  step  of  the 
ADI  sequence  (30)  involves  the  solution  of  a  linear  system  of  equations  having  a  block-tridiagonal 
coefficient  matrix. 


5.  Numerical  results  for  the  serial  code 

The  results  corresponding  to  the  numerical  simulation  of  the  2-D  fluid  flow  in  the  nozzle 
described  in  figure  1  got  on  the  VAX  8600  can  be  found  in  [10]. 


6.  Parallelization  of  the  serial  code  on  the  iPSC  Intel 

6.1.  Introduction 

We  are  dealing  now  with  the  parallelization  of  the  serial  code  on  the  iPSC  Intel.  More  precisely, 
we  want  to  determine  if  such  a  parallel  machine  can  handle  a  CFD  problem  and  with  what  level  of 
performance. 

We  want  to  point  out  that  this  2-D  numerical  simulation  of  a  fluid  flow  for  a  compressible 
viscous  fluid  made  on  the  iPSC  Intel  can  be  viewed  as  a  first  step  for  the  study  of  a  more  general 
flow  of  dimension  3  on  a  distributed  memory  parallel  machine. 

We  have  seen  that  the  numerical  algorithm  based  on  the  Beam  -  Warming  Implicit  method  for 
the  solution  of  the  2-D  Navier-Stokes  equations  leads  to  an  ADI  sequence,  which  has  already  been 
studied  both  theorically  and  pratically  on  the  hypercube  (see  [7,  15]). 

First,  we  recall  the  ADI  sequence  (see  equations  (30)),  that  we  can  express  here  as  follows: 

A'AW  =  bh  fori  =  2,...,  31  (32. a) 

=  A q%  for  .  =  2 . 63  (32.6) 

with  the  updated  vector  solution  q^j1  given  by: 

C'  =  ft  +  A«?i  (**•«) 

for  the  definition  of  A",  ,  C.",  of  size  4x4,  and  6?.-  see  chapter  4. 

The  purpose  of  this  work  is  to  study  the  parallelization  of  the  specific  ADI  sequence  given 
above  (equations  (32))  for  our  computational  domain.  The  mesh  is  of  size  64x32,  where  64  points 
are  used  in  the  horizontal  direction  and  32  points  in  the  vertical  direction,  which  amounts  to  2048 
grids  points,  which  in  turn  correspond  to  8192  unknowns. 

More  precisely,  we  solve  the  30  equations  (32.a)  (respectively  62  equations  (32.b))  with  i 
varying  from  2  to  63  (respectively  with  j  varying  from  2  to  31)  which  lead  to  the  computation  of 
30  linear  systems  of  size  248  (62x4)  (respectively  62  linear  systems  of  size  120  (30x4)). 

Pratically,  we  use  a  one  dimensional  domain  decomposition  embedded  on  the  hypercube  com¬ 
posed  of  l=2d  processors  where  d  is  the  dimension  of  the  hypercube. 


6.2.  Method 


The  work  on  the  hypercube  is  essentially  based  on  a  block  method  approach  that  is  to  say 
that  the  computational  domain  is  alternatively  decomposed  in  horizontal  and  vertical  strips. 

The  idea  is  to  assigned  to  each  processor  a  certain  subset  of  contiguous  rows  and  then  columns, 
the  size  of  the  subset  depending  on  the  dimension  of  the  cube. 

The  solution  of  the  resulting  linear  systems  corresponding  to  each  processor  can  then  be  solve 
by  using  a  sequential  solver. 

With  this  approach  some  data  have  to  be  shared  between  the  processors,  and  for  the  hypercube, 
this  is  done  by  message  passing.  More  precisely,  for  the  solution  of  this  problem  we  use  two  types 
of  communications,  a  nearest  neighbor  communication  scheme  between  two  processors  in  order 
to  form  the  linear  systems  (equations  32  a)  and  a  global  communication  scheme  which  makes  the 
transition  from  the  horizontal  decomposition  to  the  vertical  one  and  vice  versa. 

6.2.1.  Horizontal  decomposition 

First,  the  computational  domain  is  decomposed  in  /  horizontal  strips,  where  each  strip  is  as¬ 
signed  to  one  processor,  see  figure  below  (f=4  i.e.  d=  2). 


Figure  2:  Horizontal  decomposition  for  the  2-d  cube. 


By  a  step  of  nearest  neighbor  communication  for  q  we  can  form  the  matrices  A,  and  the  r.h.s 
b  for  all  the  rows.  This  step  is  done  by  using  a  Gray  code  characterizing  the  different  processors 


and  the  different  blocks  of  the  cube  (see  [13,  14]). 

The  solution  of  the  m  =  32//  linear  systems  of  order  248  are  then  solved  for  each  processor  using 
locally  a  Linpack  solver. 

This  step  corresponds  to  the  first  half  step  of  the  ADI  sequence  (equations  32  a)  i.e.  for  the 
i-direction  of  the  ADI  sequence. 

6.2.2.  Global  communication  of  A q*  and  q 

After  the  solution  of  the  first  step  of  the  ADI  sequence,  chunks  of  data  have  to  be  exchanged 
among  the  processors  in  order  to  assign  j.  contiguous  set  of  columns  to  each  processor. 

This  step  makes  the  transition  between  the  horizontal  and  the  vertical  decompositions  and 
can  be  done  by  different  ways  (see  [8,  14]).  I  have  experimented  here  two  of  them,  the  first  called 
“linear  communication”  consits  of  /  -  1  exchanges  of  blocks  of  data  from  each  processor  to  all  the 
others,  and  in  collaboration  with  Faisal  Saied  a  second  one  called  “log  communication” ,  based  on 
the  representation  of  the  different  blocks  by  binary  numbers  following  a  Gray  code  scheme. 
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Figure  3:  Blocks  of  data  to  be  exchanged  for  the  2-d  cube. 

The  arrows  inside  the  blocks  show  the  destination  of  the  chunks. 

For  the  second  case,  the  global  communication  takes  log2(/)  time  steps,  if  /  is  the  total  number 
of  processors  of  the  hypercube. 

This  scheme  is  explained  in  [ll]  and  is  presented  here  in  a  synthetic  way.  At  each  time  step 
of  this  global  communication  scheme  each  processor  sends  half  of  its  data  to  one  of  its  neighbors 
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in  the  cube.  The  choice  of  the  data  sent  and  how  to  reorder  the  received  data  in  each  processor 
is  function  of  the  binary  number  of  the  corresponding  block.  This  second  method  is  clearly  better 
than  the  first  one  in  term  of  efficiency,  but  also  more  difficult  to  implement. 

6.2.3.  Vertical  decomposition 

The  step  of  global  communication  leads  to  the  decomposition  of  the  computational  domain  in 
/  vertical  strips. 

Without  a  new  step  of  nearest  neighbor  communications,  we  can  form  the  matrices  Cl  ;  for  all  the 
columns,  and  then  we  solve  locally  the  n  =  64//  linear  systems  of  order  120,  corresponding  to  the 
second  half  step  of  the  ADI  sequence  (equations  (32. b)). 

6.2.4.  Update  the  vector  solution 

This  step  consists  of  updating  the  vector  solution  q  for  each  section  of  each  vertical  strip 
(equations  (32.c))  in  each  processor. 

6.2.5.  Global  communications  of  q 

This  step  makes  the  transition  between  the  vertical  and  the  horizontal  decomposition  of  the 
computational  domain  and  is  done  with  the  two  communications  schemes  seen  before. 

More  exactly,  the  vector  solution  q  is  transposed  back  among  the  processors  in  order  to  begin 
a  new  time  step  of  the  ADI  sequence. 


6.3.  Results 

We  have  experimented  our  parallel  code  for  cubes  of  dimensions  from  1  up  to  5,  that  is  to  say 
for  1  up  to  32  processors. 

Two  of  the  main  characteristics  of  a  parallelized  algorithm  are  its  speedup  and  its  efficiency 
(see  figures  4  et  5).  If  Ti  is  the  execution  time  of  the  serial  algorithm,  and  Tp,  the  execution  time 
for  the  same  type  of  algorithm  with  p  processors,  then  the  speedup  is  defined  by  Sp  =  ^  <  p  and 
the  efficiency  by  Ep  =  <  1.  The  algorithm  time  versus  the  communication  time  gives  an  idea 

of  the  proportion  of  the  communication  time  in  the  whole  computation  time.  In  particular,  we 
can  see  in  the  figures  (6)  and  (7)  that  the  linear  scheme  is  better  in  terms  of  communication  time 
when  the  number  of  processors  is  small,  and  the  log  scheme  becomes  better  when  the  number  of 
processors  increases. 

Otherwise,  it  appears  that  the  code  running  with  the  linear  communication  scheme  for  the 


global  communication  gives  some  good  results.  In  particular,  we  get  speedups  of  1.96  for  2  pro¬ 
cessors  and  22.12  for  32  processors,  which  correspond  to  the  following  efficiencies  of  98  %  for  2 
processors  and  69  %  for  32  processors. 

In  the  figure  (6),  we  can  see  that  for  this  first  scheme,  the  communication  time  is  an  increasing 
function  of  l  and  is  at  most  equal  to  20  %  of  the  arithmetic  time  for  32  processors. 

As  expected,  the  code  running  with  the  “log  communication”  scheme  is  better.  The  resulting 
speedups  range  from  1.96  for  2  processors  to  25.22  for  32  processors  which  correspond  for  this  last 
case  to  an  efficiency  of  79  %.  In  figure  (7)  the  communication  time  increases  with  the  number  of 
processors  up  to  /  =8  and  then  decreases,  in  particular  we  found  that  it  decreases  to  14.5  %  of 
the  arithmetic  time  for  32  processors. 


Figure  4:  Speedup  on  the  iPSC  Intel 
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Figure  7:  Arithmetic  and  communication  times  for  the 
log  scheme 

7.  Conclusions: 

The  good  results  got  for  the  simulation  of  a  2-D  fluid  flow  in  a  nozzle  on  the  iPSC  Intel 
Hypercube  show  that  a  CFD  (Computational  Fluid  Dynamics)  problem  based  on  the  solution  of 
an  ADI  sequence  can  be  easily  adapted  on  a  distributed  memory  parallel  machine. 

The  main  reason  is  that  this  problem  can  be  divided  in  as  many  independant  parts  as  needed, 
depending  of  the  number  of  processors  and  this,  without  adding  any  computations.  This  point 
explains  why  we  got  such  good  speedups. 

About  the  communications  among  the  processors,  we  have  seen  that  the  “linear”  algorithm 
gave  some  good  results,  and  is  easy  to  implement.  The  “log”  scheme  is  far  more  difficult  to 
implement  but  brings  some  improvements,  especially  when  the  number  of  processors  increases. 

The  theorical  analysis  of  the  different  running  times  and  of  the  speedups  got  for  this  compu¬ 
tation  is  actually  done  and  will  be  detailed  in  a  next  paper. 
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